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ESTIMATING SUFFICIENT REDUCTIONS OF THE PREDICTORS 
IN ABUNDANT HIGH-DIMENSIONAL REGRESSIONS 

By R. Dennis Cook 1 , Liliana Forzani and Adam J. Rothman 2 

University of Minnesota, Instituto de Matemdtica Aplicada del Litoral 
and University of Minnesota 

We study the asymptotic behavior of a class of methods for suffi- 
cient dimension reduction in high-dimension regressions, as the sam- 
ple size and number of predictors grow in various alignments. It 
is demonstrated that these methods are consistent in a variety of 
settings, particularly in abundant regressions where most predictors 
contribute some information on the response, and oracle rates are 
possible. Simulation results are presented to support the theoretical 
conclusion. 

1. Introduction. There are many facets to the analysis of data in high 
dimensions, depending on the type of application, some relying on dimen- 
sion reduction, others relying on variable selection and a few employing both 
tactics. There has been considerable interest in dimension-reduction meth- 
ods for the regression of a real response Y on a random vector of predictors 
X£l ? since the introduction of sliced inverse regression [SIR; Li (1991)] 
and sliced average variance estimation [SAVE; Cook and Weisberg (1991)]. 
A common goal of these and many other methods is to reduce the dimen- 
sion of the predictor vector without loss of information about the response. 
The aim is to estimate a reduction R:IR P — > M. d , d <p, with the property 
that YiLX|R(X) or, equivalently, X|(Y,R(X)) ~ X|R(X) [Cook (2007)]. 
In this way R is sufficient because it captures all the information about Y 
that is available from X. Sufficient reductions are not determined uniquely 
by this definition because any bijective transformation of R is also sufficient. 

Nearly all methods for sufficient dimension reduction (SDR) restrict at- 
tention to the class of linear reductions, which arise naturally in many con- 
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texts. Linear reduction can be represented conveniently in terms of the pro- 
jection P5X of X onto a subspace S C M. p . If Y JL X|PsX, then 5 is called 
a dimension-reduction subspace. Under mild conditions the intersection of 
any two dimension-reduction subspaces is again a dimension-reduction sub- 
space and that being so the central subspace Sy\x, defined as the intersec- 
tion of all dimension-reduction subspaces, is taken as the inferential target 
[Cook (1994, 1998)]. A minimal sufficient linear reduction is then of the form 
R(X) = r) T ~K, where r\ is any basis for 5y|x- 

SDR has a long history of successful application and is still an active 
research area. Recent novel SDR methods include likelihood-based sufficient 
dimension reduction [Cook and Forzani (2009)], kernel dimension reduction 
[Fukumizu, Bach and Jordan (2009)], shrinkage inverse regression estimation 
[Bondell and Li (2009)], dimension reduction for nonelliptically distributed 
predictors [Li and Dong (2009), Dong and Li (2010)], cumulative slicing 
estimation [Zhu, Zhu and Feng (2010)], dimension reduction for survival 
models [Xia, Zhang and Xu (2010)] and dimension reduction for spatial point 
processes [Guan and Wang (2010)]. This body of work reflects three different 
but related frontiers in SDR: extensions that require progressively fewer 
assumptions, development of likelihood-based methods and adaptations for 
specific areas of application. Almost all SDR methods rely on traditional 
asymptotic reasoning for support, letting the sample size n — > 00 with p 
fixed. They nearly all require the inverse of a p x p sample covariance matrix 
and thus application is problematic when n < p. Since accurate estimation 
of a general p x p covariance matrix can require n> p observations, it has 
seemed inevitable that SDR methods would encounter estimation problems 
when n is not sufficiently large. Chiaromonte and Martinelli (2002), Li and Li 
(2004) and others circumvented these issues by performing reduction in two 
stages, first replacing the p predictors with p*<n principal components 
and then applying an SDR method to the regression of the response on 
the selected p* components. However, recent results on the eigenvectors of 
sample covariance matrices in high-dimensional settings raise questions on 
the value of such two-stage methods [see, e.g., Johnstone and Lu (2009)]. 
Cook, Li and Chiaromonte (2007) proposed an SDR method that avoids 
computation of inverses and reduces to partial least squares in a special 
case. Chun and Keles (2010) showed recently that the partial least squares 
estimator of the coefficient vector in the linear regression of Y on X is 
inconsistent unless p/n— > and this raises questions about the behavior of 
the Cook et al. SDR estimator when n is not large relative to p. Li and 
Yin (2008) used the least squares formulation of sliced inverse regression 
originated by Cook (2004) to develop a regularized version that allows n <p 
and achieves simultaneous predictor selection and dimension reduction. This 
seems to be a promising method, but its asymptotic properties are unknown 
and it may not work well when the regression is not sparse. Wu and Li 
(2011) studied the asymptotic properties of a family of SDR estimators, 
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using a SCAD-type penalty for variable selection in sparse regressions where 
the number of relevant variables is fixed as p — > oo. Their method requires 
that p/n — > for consistency. 

While sparsity is an important concept in high-dimensional regression, 
not all high-dimensional regressions are sparse. For example, near-infrared 
reflectance is often measured at many wavelengths to predict the compo- 
sition of matter, like the protein content of a grain. There is not normally 
an expectation that only a few wavelengths are needed to predict content. 
While some wavelengths may be better predictors than others, it is the cu- 
mulative information provided by many wavelengths that is often relevant. 
Partial least squares has been the dimension-reduction method of choice in 
this type of regression. The regressions implied by this and other nonsparse 
applications share similar characteristics: (1) the predictor vectors are high- 
dimensional and typical area-specific analyses have employed some type of 
dimension reduction; (2) while assessing the relative importance of the pre- 
dictors may be of interest, prediction is the ultimate goal; (3) information on 
the response is thought to accumulate, albeit perhaps slowly, as predictors 
are added, and (4) sparsity is not a driving notion. 

In this article we introduce a family of SDR methods for studying high- 
dimensional regressions that differs from past approaches in at least three 
important ways. First, we do not require sparsity but rather we empha- 
size abundant regressions where most of the predictors contribute some in- 
formation about the response. In the logic of Friedman et al. (2004), the 
bet-on-sparsity principle arose because, to continue the metaphor, there is 
otherwise little chance of a reasonable payoff. We show in contrast that rea- 
sonable payoffs can be obtained in abundant regressions with prediction as 
the ultimate goal, leading to a contrasting bet- on- abundance principle. Sec- 
ond, SDR studies have largely focused on properties of estimators of 5y|x- 
We bypass this step and instead consider the limiting behavior of estima- 
tors of the sufficient reduction R(X) itself, assuming that the dimension d 
of R is fixed. More specifically, letting R denote an estimated reduction, we 
establish rates of convergence in the following sense. Let X^v denote a new 
observation on X. If R(Xtv) — R(Xjv) = O p (r(n,p)) and if r(n,p) — > as 
n,p— > oo, then R(Xat) is consistent for R(Xtv) and its convergence rate 
is at least r . Third, we integrate recent work on the estimation of high- 
dimensional covariance matrices into our approach. In particular, we es- 
timate a critical matrix of weights by using sparse permutation invariant 
covariance estimation (SPICE) as developed by Rothman et al. (2008). 

In sum, by considering the reduction R itself rather than the central sub- 
space 5y|x, we both introduce a novel viewpoint for addressing dimension 
reduction and develop theoretically grounded SDR methodology for n < p 
regressions where other methods either have no asymptotic support or must 
necessarily fail. 
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We describe the model for our study and its sufficient reduction in Sec- 
tion 2. The class of estimators that we use is described in Section 3, and 
stabilizing restrictions are presented in Section 4. Sections 5 and 6 contain 
theoretical conclusions for selected estimators from the class described in 
Section 3. Simulation results are presented in Sections 5.2 and 7. We turn 
to a spectroscopy application in Section 8 and a concluding discussion is 
given in Section 9. All proofs and additional simulation results are available 
in a supplemental article [Cook, Forzani and Rothman (2012)]. 

The following notational conventions will be used in our exposition. We 
use M. pxq to denote the collection of all real p x q matrices. We use ||A|| 
and ||A||j7 to denote the spectral and Frobenius norms of A. The largest 
and smallest eigenvalues of A G M pxp are denoted ip ma , x (A) and <£>mi n (A). 
If A G W pxp , then diag(A) G W xp is the diagonal matrix with diagonal el- 
ements equal to those of A. vec(A) is the operator that maps A G M. pxq 
to M. pq by stacking its columns. If B G W xq and A G M. pxp is symmetric and 
positive definite, then the operator that projects in the A inner product 
onto span(B), the subspace spanned by the columns of B, has the matrix 
representation Pb(a) = B(B T AB)~ 1 B T A, and Qb(a) = Ip — Pb(A)- Pb 
indicates the projection onto span(B) in the usual inner product. A basis 
matrix for a subspace S C MP of dimension d is any matrix B 6 l px<i whose 
columns form a basis for S. For nonstochastic sequences {a n } and {fe n }, we 
write a n >c b n if there are constants m, M and iV such that < m < \a n /b n \ < 
M < oo for all n > N. Similarly, for stochastic sequences {a n } and {b n }, we 
write a n x p b n if a n = O p (b n ) and b n = O p (a n ). For A, B G R pxp , A > B 
means that A — B is positive definite. <S> denotes the Kronecker product, 
and X ~ Y means X and Y are equal in distribution. Deviating slightly 
from convention, we do not index quantities by n and p, preferring instead 
to avoid notation proliferation by giving reminders from time to time. 

2. Model. We assume throughout that the data consist of independent 
observations (l^,Xj), i = 1, . . . ,n, and that p is increased by taking addi- 
tional measurements on each of n sampling units. We treat the process of 
adding predictors as conditional on the response and so our approach is 
based on inverse reduction, X|(Y,R(X)) ~ X|R(X), which is a standard 
SDR paradigm. Limits as n,p— > oo are thus conditional on the responses 
unless specifically indicated otherwise. 

2.1. Inverse regression model. The model that we employ is engendered 
by standard SDR assumptions. We first review those assumptions briefly 
and then give a statement of the model. 

Classical methods like SIR require two predominant and well-known con- 
ditions for estimation of 5y|x- The first, called the linearity condition, in- 
sists that E(X|?7 T X) be a linear function of r] T ~K, where r] is a basis matrix 
for 5y|x- It must be valid only at a true basis and not for all rj G M pxd . This 
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condition is generally regarded as mild since it holds to a good approxima- 
tion when p is large [Hall and Li (1993)]. Let Se(x.\y) denote the subspace 
spanned by E(X|Y = y) — E(X) as y varies in the sample space of Y . Then 
the linearity condition implies that <Se(x|y) Q var (X)5y|X) which is used 
as a basis for estimating a subspace of Syix- The second, called the cover- 
age condition, requires that <S#(x|Y) = var (X)5y|x an d it too is generally 
regarded as mild in many applications. While the linearity and coverage con- 
ditions are standard requirements for root-n consistent methods based on 
the inverse mean function E(X| Y"), the actual performance of those methods 
depends also on the inverse variance function var(X|Y). For instance, Bura 
and Cook (2001) concluded based on simulations that SIR works best when 
var(X|y) is nonstochastic, and Cook and Ni (2005) demonstrated analyti- 
cally that SIR can be quite inefficient when var(X|Y) varies. We refer to the 
requirement that var(X|Y) be nonstochastic as the covariance condition. If 
the linearity and covariance conditions hold, then E(var(X|?7 r X)|Y) must 
be nonstochastic as well. This is related to the usual covariance condition — 
var(X|77 T X) is constant — used by SAVE and other second-order methods. 
See Li (1991), Cook and Ni (2005) and Li and Dong (2009) for further dis- 
cussion of these conditions. 

We assume the linearity, coverage and covariance conditions as a basis for 
our study. Under these conditions it can be shown straightforwardly that 
var(X)5y| X = A<Sy| X , where A = var(X|Y). Consequently, letting f G W xd 
be a basis matrix for A5y|x, we are led to the model 

(2.1) X i = ^ + f| i + e i , i = l,...,n, 

where n = E(X), the error vectors Si are independent copies of a random 
vector e6K p with mean and variance var(e) = A, and ^ = £(Y) is the ith 
instance of an unknown vector- valued function $(Y) € M. d with E(£) = that 
gives the coordinates of E(X|Y) — E(X) in terms of T and is independent 
of e. In this representation 5y|x = A -1 span(r). 

Neither T nor £ is identified in model (2.1), because for any conforming 
nonsingular matrix A, T£ = (I 1 A)(A~ 1 ^), leading to a different parameter- 
ization. This nonuniqueness has been mitigated in past studies with p fixed 
by requiring r T r = 1^. However, that parameterization is not workable when 
allowing p— > oo, and for this reason we adopt different restrictions. 

The next step is to reparameterize model (2.1) to satisfy constraints 
that will facilitate our development. Specifically, we construct an equiva- 
lent model 

(2.2) X* = + + i = l,...,n, 

where span(T) = span(r), £j = £(Y) is the coordinate vector relative to the 
new basis matrix T, and we center the £j's in the sample so that £ = 0. 
Let the rows of S S R nxd and 3 £ R nxd be |f and £[ , i = 1, . . . , n. With- 
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out loss of generality, we impose on model (2.2) the constraints that (1) 
rt _1 H M n E = 1^, where M n is a scaling matrix that is defined in Sec- 
tion 4.1, and (2) r T WT is a diagonal matrix where W > is a symmetric 
p x p population weight matrix that is discussed in Section 3. To see how 
this is done starting from model (2.1), let T = (n~ 1 3 T M n H)~ 1 / 2 and let 
O G R dxd be an orthogonal matrix so that T T 1 f T Wf T _1 is a diag- 
onal matrix. Then Sf T = HTOO T T- 1 f T = Hr T , where S = 3TO and 
T T = T T _1 r T satisfy the constraints by construction. Model (2.2) can be 
represented also in matrix form as 

(2.3) X = l n /z T + 3T T + e, 

where X G M. nxp and e G M nxp have rows X?" and ej, and e has mean and 
variance var(vec(e T )) =I„® A. 

Since bijective transformations of a sufficient reduction are themselves 
sufficient, we define R to be the coordinates of the projection of X — fi onto 
span(r) in the A" 1 inner product: 

(2.4) R(x) = (r r A" 1 r)- 1 r r A- 1 (x-/x), 

where the first factor (r T A _1 r)" 1 stabilizes E(R) as p — > oo. 

3. Estimation. Without further structure it is not possible to use mod- 
el (2.2) to estimate R(X) since the coordinate vectors £j are unknown. 
However, estimation is possible by approximating the coordinate vectors 
as £j ~ bfj, where b G R rfxr , d < r, and fj = f(l^) is the ith. realization 
of a known user-selected vector- valued function f(Y) G R r . Without loss 
of generality we center the sample ^17=1 ^ = ^ ^ I^ nxr be the matrix 

with rows f?\ and assume that $ n = ¥ T ¥/n > and $ = lim n $ n > 0, unless 
r = and then of course F is nil. We next describe the class of estimators 
we use, postponing discussion of possible choices for f until Section 3.2. 

3.1. Estimators. The class of reduction estimators R that we study is 
based on using estimators of (/i, b, T) from a subclass of the family of inverse 
regression estimators proposed by Cook and Ni (2005): 

(£,b, f ) = argmintr{(X - l nt x T - Fb T T r )W(X - \ ni i T - Fb r r T ) T }, 

where the minimization is over /x G W, T G M. pxd and b G M. dxr subject to 
the constraints that b$ n b T = 1^ and that r T Wr is a diagonal matrix. 
A particular estimator is determined by the choice of the sample weight 
matrix W G M. pxp with W being a population version. Some instances of W 
that we introduce later correspond to W = I p , A~ x and diag _1 (A). We 
next report the estimators from this family. A sketch of the derivation is 
available in the supplemental article [Cook, Forzani and Rothman (2012)]. 
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It is easily seen that p, = X. Let Z = X- l n £ T , and let B = Z T F(F T F)^ 1 £ 
M pxr denote the matrix of regression coefficients from the least squares fit 
of X on f , assuming that n > r + 1. Also, let the columns of £ R rxd be 
the first d eigenvectors of 

(3.1) K = *y 2 B T WB* r Y 2 £ R rxr , 

assuming that the dth eigenvalue of K is strictly larger than the (d+ l)st 
eigenvalue. This assumption will be true with probability 1 for the choices 
of W considered here. Then the estimators are b = Vj$ n 1 ' 2 , T = B<&^ V^, 
f T wf = VjKV d and 

(3.2) R(X) = (f T wf)" 1 f T W(X-X). 

The diagonal elements of the diagonal matrix T T WT consist of the first 
(largest) d eigenvalues of K, and b is the coefficient matrix from the OLS 
fit of R(Xi) on fi. 

3.2. Choice off. Clearly we should strive to choose an f(Y) so that, for 
some b £ R rfxr , ^ is well approximated by bf; for i = 1, . . . , n. This intuition 
is manifested in various calculations via the requirement that rank(?i~ 1 H T F) = 
d for all n. As an extreme first instance, suppose that H T F = 0. Then B = 
Z T F(F T F)- 1 = (T3 T F + e T F)(F T F)- x = e T (F T F)~ 1 and consequently B 
provides no information on span(T). Since n _1 H T F = cdv(£(Y), f (Y)), this 
requirement is essentially that the true coordinate vector ^ is sufficiently 
correlated with its approximation, and it is equivalent to the condition de- 
rived by Cook and Forzani (2008) under a PFC model with normal errors 
and p fixed. In the remainder of this article we assume that, for all n, 

(3.3) rank(n _1 3 T F) = d and n _1 H T F = 0(l) asn^oo, 

where the order is nonstochastic because we condition on the observed values 
of Y in our formal asymptotic calculations. 

There are many ways to choose an appropriate f in practice. Under 
model (2.2), each coordinate Xj, j = 1, . . . ,p, of X follows a univariate linear 
model with predictor vector i(Y). When Y is quantitative we can use in- 
verse response plots [Cook (1998), Chapter 10] of Xj versus Y, j = 1, . . . ,p, 
to gain information about suitable choices for f . When p is too large for 
a thorough graphical investigation, which is the case in the context of this 
article, we can consider f 's that contain a reasonably flexible set of basis 
functions, like polynomial terms in Y. In some regressions there may be 
a natural choice for f . For example, suppose that Y is categorical, taking 
values in one of m categories k = 1, . . . , m. We can then set r = m — 1 
and specify the kth coordinate of f to be the indicator function J(y £ C&). 
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Another option consists of "slicing" the observed range of a continuous Y 
into m bins (categories) C/%, k = 1, . . . ,m. A rudimentary set of basis func- 
tions can then be constructed by specifying the A;th coordinate of f as for 
the case of a categorical Y. This has the effect of approximating each con- 
ditional mean E(X,-|Y) as a step function of Y with m steps. This option 
is of particular interest because it exactly reproduces SIRs estimate of 5y|x 
in the traditional large- ra setting [Cook and Forzani (2008)]. Cubic splines 
with the endpoints of the bins as the knots are a more responsive option 
that is less prone to loss of intra-slice information. 

4. Universal context. Our goal is to study the limiting behavior of 
R(Xtv) — R(Xjv), where R and R are given by (2.4) and (3.2). This differ- 
ence depends on the choice of W and the behavior of the true reduction R as 
p — > oo. In this section we measure and constrain the interaction between W 
and the model. We also place weak constraints on R to help ensure well- 
behaved limits. The context described in this section will be assumed to hold 
throughout this article, without necessarily being declared in formal state- 
ments. We will revisit these constraints occasionally during the discussion, 
particularly when some of them are implied by other structure. 

4.1. Scaling matrix M n . Since bijective transformations of sufficient re- 
ductions are also sufficient, we need to have R(Xjv) and R(Xyv) in the 
same scale to ensure that their difference R(Xtv) — R(Xyv) is a useful mea- 
sure of agreement. This can be accomplished by choosing the scaling matrix 
M n = Pf so that the first model constraint stated following (2.2) becomes 
n - 1 3 T M n 3 = n _1 3 T P F H £ R dxd , which is a rank-d matrix by (3.3). This 
choice ensures that T is the lead term in the asymptotic expansion of T, 
which places R(Xjv) and R(Xjy) on the same scale. 

4.2. Signal rate. We define the signal rate to be a positive monotoni- 
cally increasing function h(p) = 0(p) that measures the rate of increase in 
the population signal: let Gh = r^WT /h(p), which is a diagonal matrix 
because r T WT is diagonal by construction. Then h(p) is chosen to meet 
the requirement 

(4.1) lim G h = G e M. dxd , 

p— >oo 

where G is a diagonal matrix that is assumed to be positive definite with 
finite elements. The signal rate is not needed for computation of R but it 
will play a key role in later developments. It depends via W on the specific 
estimator selected, although this is not indicated notationally. When W > 
the bounds tp min (W)T T T < r T WT < 99 max (W)r T r can be used to aid 
intuition on the magnitude of h(p) by presuming properties of T and W. For 
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example, consider regressions in which </? m i n (W) and 9J m ax(W) are bounded 
away from and oo as p — > oo and a positive fraction g, < a < g < 1, of the 
rows of r is sampled from a multivariate density with finite second moments 
and the other rows of T are all zero vectors. Then the number of nonzero 
rows pg x p, h(p) x p, and we say that the regression has an abundant signal. 
Regressions with near abundant signals like h(p) xp 2 / 3 may often perform 
similarly in practice to regressions with abundant signals. It is technically 
possible to have regressions in which p = o(h(p)), but we would not normally 
expect this in practice. On the other extreme, sparse regressions are those 
in which h(p) x 1, so only finitely many predictors are relevant or the signal 
accumulates very slowly as p — > oo. Regressions with near sparse signals 
have, say, h(p) = o(p 1 ^ 3 ). Typically we will use h(p) in definitions and formal 
statements, but use the abbreviated form h otherwise. We assume in the rest 
of this article that (4.1) holds. 

4.3. Limiting reduction. Our second requirement is that, as p— > oo, the 
spectral norm ||var(R)|| converges to a finite constant, which may be 0. If 
this is not so, then the variance of some linear combinations a T R will diverge 
as p — > oo, and the very notion of dimension reduction for large-p regressions 
becomes problematic. 

Definition 4.1. A reduction R(X) is stable if lim p ^ 00 ||var(R(X))|| < 
oo. 

The following lemma gives a sufficient condition for stability that incor- 
porates the signal rate. In preparation, define 

(4.2) p = W^AW 1 / 2 eR pxp , 

which is manifested in various asymptotic expansions as a measure of the 
agreement between W and A -1 . 

Lemma 4.1. If ||p|| = 0(h(p)), then reduction (2.4) is stable. 

According to Lemma 4.1, if W = A" 1 , then the reduction is stable re- 
gardless of h. All regressions in which \\p\\ is bounded yield stable reduc- 
tions. Bounded eigenvalues have been required in various studies to avoid 
ill-conditioned covariance matrices [see, e.g., Bickel and Levina (2008a) and 
Rothman et al. (2008)]. More generally, the requirement for a stable reduc- 
tion is that h(p) must increase at a rate that is no less than the rate at 
which ||p|| increases. In particular, if h = 0(1), then the sufficient condition 
of Lemma 4.1 requires that ||p|| is bounded. 

The rates that we develop depend on the functions K(n,p) = [^/{^(p)^}] 1 / 2 
and ip(n,p, p) = \\p\\F/{h(p)-*/n}. The interpretation and roles of these func- 
tions will be discussed later; for now we state across-the-board requirements 
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that, as n,p — > oo, 

(4.3) K(n,p) = 0(l) and i/)(n,p, p) = O(l). 

These functions will nearly always be written without their arguments. We 
assume in the rest of this article that all regressions are stable and that the 
orders given in (4.3) hold. 

4.4. Joint constraints on the weights W and errors e. So far we have as- 
sumed only that the errors Si are independent copies of the random vector e 
which has mean and variance A. We impose two additional constraints to 
ensure well-behaved limits: as p — > oo, 

(W.l) E(e T We) = 0(p), and 
(W.2) var(e T We) = 0(p 2 ). 

Condition (W.l), which is equivalent to tr(p)/p = 0(1), seems quite mild. 
For example, if we use unweighted fits with W = W = I p , then this con- 
dition is simply that the average error variance tr(A)/p is bounded. Con- 
dition (W.2) can be seen as placing constraints on W and on the fourth 
moments of e. It too seems mild, although it is more constraining than 
the first condition. The following lemma describes a few settings in which 
condition (W.2) holds. 

Lemma 4.2. Let e = (e^) = A~ 1 / 2 e so that E(e) = and var(e) = I p . Let 
4>ij =E{efe 2 j ), i,j = l,...,p. Then: 

(i) Lf e ~ N(0, A), then condition (W.l) implies condition (W.2). 

(ii) // W = A" 1 and if fa < (f>< oo as p —> oo, then condition (W.2) 
holds. 

(iii) If (a) the elements e, of e have symmetric distributions or are in- 
dependent and if (b) fa < (ft < oo as p — > oo, then condition (W.l) implies 
condition (W.2). 

We assume in the rest of this article that conditions (W.l) and (W.2) 
hold. 

5. W converging in spectral norm. In this section we describe our re- 
sults for the limiting reduction when W converges in the spectral norm. 
Specifically, we require the following two conditions: 

(5.1) There exists a population weight matrix W € W xp so that the 
spectral norm of S = W~ 1 / 2 (W — W)W~ 1 / 2 converges to in probability 
at rate at most w _1 (n,p) as n,p— > oo; equivalently, ||S|| = O p (u(n,p)) as 
lo->0. 

(5.2) ||E(S 2 )|| =0(uj 2 (n,p)) asw^O. 



ABUNDANT SUFFICIENT DIMENSION REDUCTION 



11 



(5.2) var(,) = G?<*\ ' " ~ "J » \G~^ E R™, 



These conditions are implied by the stronger condition that E(||S 2 ||) = 
0(oj 2 ), (S.l) following from Chebyshev's inequality and (S.2) arising since 
||E(S 2 )|| < E(||S 2 ||). All of the weight matrices discussed in this article can 
satisfy these conditions, as well as other weight matrices that we have con- 
sidered, so these conditions do not seem burdensome. For ease of reference 
we denote the corresponding estimator as . 

5.1. Theoretical results. We state and discuss one of our main results in 
this section. In preparation, let 

(5.1) K = n- 2 *- 1 / 2 F T SG /l H T F*- 1 / 2 E M rxr , 

where is as defined in (4.1), let p = tr(p)/p and define the random vec- 
tor v = Kw(sn) ~ R4>tv) E where K w (e N ) = (Fwrj^r^Wejv is 
the population reduction using W applied to the error sn of a new obser- 
vation and R(eat) is the targeted population reduction (2.4) applied to the 
same error. The vector v E M. d measures the population-level agreement be- 
tween the user-selected reduction Rw and the ideal reduction. It has mean 
E(f) =0 and variance 

-1/2 / 1 T P1 ~ (7 T P ~ 1 7)~ 1 

Hp) 

where the semi-orthogonal matrix 7 = W 1 / 2 r(r r Wr)~ 1 / 2 . 

The next proposition describes asymptotic properties of K E M rxr and R^y , 
where K was defined in (3.1). 

Proposition 5.1. Assume conditions (S.l) and (S.2). Then: 

(i) J/S = 0, then h~ l (p)E(k) = K + K 2 pI r . 

(ii) h- 1 (p)(K-K) = O p (K) + O p (u). 

(iii) If, in addition, \\p\\ = 0(h(p)) , then 

R^(Xat) - R(Xjv) = v + O p (k) + O p (^) + O p (oj), 
where k and ip were defined in (4-3). 

This proposition shows that the asymptotic properties of K and R^ de- 
pend on four quantities, each with a different role. We defer discussion of the 
order O p {u) that measures the asymptotic behavior of W to later sections. 
In the rest of this section we concentrate on the remaining terms — v, O p (k) 

and O p (ip) — by assuming that W is nonstochastic, so W = W, S = and 

R^. = Rw- This involves no loss since none of these terms depends on S. 

Terms involving k affect both estimation K and prediction Rw(Xtv) — 

R(Xtv). If k were to diverge, then, from Proposition 5.1 (i) , eventually E(K) jh 
will look like a diagonal matrix and there will be little signal left, which is 
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why we imposed the universal condition (4.3) that k = 0(1)- If p diverges, 
then again E(K)//i will resemble a diagonal matrix, but this is prohibited 
by universal condition (W.l). Here we also see the role of the rank condi- 
tion (3.3) introduced at the beginning of Section 3.2. If rank(ra _1 S T F) < d, 
then rank(K) < d and again some signal will lost. In the extreme, if H T F = 0, 
then /i _1 E(K) = n 2 pl r and all information on span(r) is lost. 

If k — > 0, then hn increases faster than p increases. In effect there is 
a synergy between the signal rate and the sample size. If the regression is 
abundant, so/ixp, then k x n™ 1 / 2 . Useful results can also be obtained when 
h xp 2 ' 3 , because then k xp 1 ^ 1 ' 2 , which may be small in some regres- 
sions. In sparse regressions where hxl, (p/n) 1 / 2 , and we are back to 

the usual requirement that p/n — > for consistency of K. Equally impor- 
tant, since k does not depend on the weight matrix, the results indicate that 
it may not be possible to develop from the family of estimators considered 
rates of convergence that are faster than n . 

The v £ R d and O p (ip) terms arise from prediction. The first term u 
has a characteristic that is different from the others because it does not 
depend on n. Since var(i/) < G^HpH/Zi, the sufficient stability requirement 
||p|| = 0(h) of Lemma 4.1 guarantees that var(i^) is bounded as p — > oo. 
While the contribution of v will be negligible if var(i^) is sufficiently small, 
for the best results we should have var(^) — > as p — > oo. Let (7,7o) be an 
orthogonal matrix. Using a result from Cook and Forzani [(2009), eq. (A. 4)], 

7 T P7 - (7 T P~ 1 7)" 1 = 7 T P7o(7o P7o) _1 7(Tp7o- Consequently, var(i/) = 
when span(7) is a reducing subspace of p, even if /ixl. This result is 
similar to Zyskind's (1967) classical findings about conditions for equality 
of the best and simple least squares linear estimators in linear models. If 
W = A -1 , then p = I p and span(7) is trivially a reducing subspace of p. 
If A is a generalized inverse of W, WAW = W, and if T T WT > 0, then 
again span(7) reduces p and var(i^) = 0. 

Turning to O p (^), since ||p||f < a/pIIpH' ^ follows that ip < K,\\p\\/Vh. 
Consequently, if ||p|| = O(Vh), then O p (n) + O p (i/j) = O p (k) and Rw(Xyv) — 
R(Xjv) = v + O p (k). In a worst-case scenario, if ip = K \\p\\/Vh an d ||p||/\//i 
diverges, then O p (k) + O p (ip) = O p (^/p/^ [ /n) because of the requirement 
||p|| = 0(h) in Proposition 5.1(iii), and these terms reduce to the usual con- 
dition that p/n —> 0. 

In sparse regressions, the condition k— > reduces to p/n— > and ||p|| = 
0(h) means that ||p|| must be bounded. These two conditions imply that 
i() — > 0. Consequently, the best results in sparse regressions will be achieved 
when (a) span(7) is a reducing subspace of p, (b) p/n — > and (c) ||p|| is 
bounded. In nonsparse settings Rw will yield the best results when ||p|| is 
bounded and the regression is abundant or near abundant. We summarize 
some implications of these conditions in the following corollary. 
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Corollary 5.1. Assume that \\p\\ =0(1) and that S = 0. Then: 

(i) Rw(Xat) — R(Xjv) = O v (hr x l 2 } + O p (k). If the regression is abun- 
dant, then Rw(Xjv) — R(X^r) = O p {n~ 1 / 2 ). 

(ii) If span(7) is a reducing subspace of p, then Rw(Xjv) — R(Xtv) = 
O p (k). If, in addition, the regression is abundant, then Rw(Xjv) — R(X^r) = 
O p (n-V 2 ). 

The conclusions of Proposition 5.1 and its Corollary 5.1 hold as n — > oo 
and p — > oo in the required relationships and as n — > oo with p fixed. In the 
latter case the results simplify to those given in the next corollary. 

Corollary 5.2. Ifp = 0(l) andS = 0, then: 

(i) Rw(Xjv) - R(X,v) = var(i/) + O^rT 1 / 2 ). 

(ii) If, in addition, span('y) is a reducing subspace of p, then Rw(Xat) — 
R(X A r) = O p (n- 1 /2). 

It may be clear from the previous discussion that the best possible rate 
is achieved when W = W = A" 1 and then Rw(Xjy) — R(Xjv) = O p (k). 
Consequently, we refer to the oracle rate. 

5.2. Simulations. We conducted a simulation study with W = W = L, to 
show the importance of u, to demonstrate that Proposition 5.1 and Corollar- 
ies 5.1 and 5.2 give good qualitative characterizations of the limiting behav- 
ior of Ri(Xjy) — R(Xjy), and to provide some intuition into the nonasymp- 
totic behavior of Rj. The simulated data were generated using a simple 
version of model (2.2): X = ry + e, Y ~ JV(0, 1), T G W, was constructed 
as a vector of standard normal random variables and A is a diagonal ma- 
trix. Specific scenarios may differ on n, p and the choice of A. In any case, 
the true model then has d = 1. We confined attention to regressions with 
d = 1 because we found that there is nothing in principle to be gained from 
settings with d > 1 . For each sample size n we used the estimators described 
at the end of Section 3 with W = I p and Y 3 , j = 1,2,3,4, as the elements 
of f , so p = A and r = 4. 

For each data set constructed in this way, we determined Ri(Xjvj) 
and R(Xatj) at j = 1,...,100 new data points generated from the origi- 
nal model. We summarized each data set by computing the absolute sample 
correlation between Rj and R and the sample standard deviation of the 
difference Ri — R over the 100 new data points. This process was repeated 
at least 400 times for p < 300, 200 times for 300 < p < 800 and 50 times for 
p > 800. The average absolute correlation and average standard deviation 
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Fig. 1. Results of the first simulation described in Section 5.2 with fixed p = 50, W = I p , 
Y ~ N(0, 1) and diag(A) ~ Uniform(l, 101). A: T generated as a vector of N(0, 1) random 
variables. B: Theoretical lower bound on the standard deviation of Ri — R for curve A. 
C: r = (8.2, 0, . . . , 0) T . (a) Standard deviation of Hi— R; (b) correlation (Ri,R). 



were used as overall summary statistics. Our decision to use a diagonal ma- 
trix for A was based on the observation that the asymptotic results depend 
on A largely via ||p|| and with W = I p , ||p|| = ||A||. 

Simulation I: Illustrations of Proposition 5.1. In the first set of simulations 
we fixed p = 50 and generated T once at the outset, giving ||r|| =8.2. The 
diagonal elements of A were 50 regularly spaced values between 1 and 101. 
According to Corollary 5.2(i), as n — > oo the variance of Ri — R should con- 
verge to var(f) = 0.595 2 , where the value was determined by substituting the 
simulation parameters into (5.2). This conclusion is supported by the results 
in Figure 1(a), where we see that the standard deviation curve A converges 
nicely to the predicted asymptotic value marked by line segment B. Curve C 
in Figure 1(a) was generated by setting T = (8.2, 0, . . . , 0) T , so span(7) is 
now a reducing subspace of p = A. In this case Corollary 5.2(h) predicts 
that Ri — R will converge to at the usual root-n rate. That conclusion is 
supported by curve C. Curves A and C in Figure 1(b) correspond to those 
in Figure 1(a), except the average correlation is plotted on the vertical axis. 
Evidently the correlations for curve A are converging to a value close to 0.9, 
while those for curve C are converging to 1. Curve A suggests that when 
var(i/) > we might still gain useful results in practice if the correlation is 
sufficiently large. 

Simulation II: Bounded and unbounded error variances. The structure of 
the second set of simulations was like the first set, except we varied n = p/2 
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(a) (b) 

Fig. 2. Results from simulation II described in Section 5.2 with n = p/2, W = I p , 
Y ~ iV(0, 1), r ~ iV(0,I p ) and diag(A) ~ Uniform(l,u). (a) Standard deviation o/Ri — R; 
(b) correlation (Ri , R) . 

and, at the start of each replication, T and Y were generated anew and the 
diagonal elements of A were sampled from the uniform distribution on the 
interval (l,it), where u = 101 or u = p + 1. The regeneration at the start of 
each replication was used to avoid tying results to a particular parameter 
configuration. For this simulation the ratio of the systematic variation in X 
to its total variation is var -1 / 2 (X) var(E(X|T, Y, A)) var~ 1 / 2 (X) = (2/(u + 
3))I p , where the moments were computed over the simulation distributions 
of X, Y, r and A. Consequently, it seems fair to characterize the signal 
as weak when u > 101, since then the systematic variation accounts for less 
than 2% of the total variation. 

The regression is abundant and k = o(l) in this simulation. Additionally, 
when u = 101, ||p|| is bounded and the results of Corollary 5.1(i) apply, 
Ri — R = O p (ra -1 / 2 ). The simulation results for this case, which are shown 
by the curves labeled ll u = 101" in Figure 2, support the claim that standard 
deviation of Ri — R is converging to and the correlation between Ri and R 
is approaching 1. On the other hand, when u = p + l, \\p\\ is unbounded and 
the results of Corollary 5.1 do not apply. However, we still might expect the 
standard deviation and correlation to converge to values away from and 1. 
The simulation results shown by the curves labeled u = p + 1 in Figure 2 
sustain this expectation. 

5.3. SPICE. Beginning with a little background, we turn in this section 
to weight matrices W E W xp selected by using SPICE. 
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Restricting attention to population weight matrices that are equal to the 
inverse of the error covariance matrix, W = A -1 , allows for the application 
of some modern regularized inverse covariance estimators with reasonable 
convergence rates of W € W xp as n,p — > oo. For example, rates of con- 
vergence have been established for banding or tapering [Bickel and Levina 
(2008a)] and element-wise thresholding [Bickel and Levina (2008b)] of the 
sample covariance matrix. However, using these approaches to estimate the 
population weight matrix is problematic. Banding/tapering the sample co- 
variance matrix is not invariant under permutations of variable labels, and 
in finite samples, both banding/tapering and thresholding the sample co- 
variance matrix may produce an estimator of the covariance matrix that is 
not positive definite. To avoid these problems, we further restrict our at- 
tention to covariance estimators that are positive definite in finite samples 
and are invariant to permutations of variable labels. Several authors have 
recently analyzed the high-dimensional inverse covariance estimator formed 
through Li-penalized likelihood optimization. We will use a particular ver- 
sion to estimate the population weight matrix (equivalently the inverse error 
covariance matrix) which was named SPICE by Rothman et al. (2008) for 
"sparse permutation invariant covariance estimator." 

Let n = n — r — 1 and let 

(5.3) A = h~ 1 Z T Q^Z G R pxp 

denote the residual sample covariance matrix from the linear regression of X 
on f . We construct W>, = A7 1 using 



Q\ = argmin 



(5.4) 



tr{diag~ 1/2 (A)Adiag- 1 / 2 (A)fi}-log|n| + A^|a j | 



A- 1 = diag- 1 / 2 (A)fi A diag- 1 /2 ( A) ) 

where A > is a tuning parameter and Ojj is the element of fi. The 

optimization in (5.4) produces a sparse estimator Q\ of the inverse error 
correlation matrix fi, which is then rescaled by the residual sample stan- 
dard deviations to give the inverse error covariance estimator A 7 . The use 
of Li-penalized likelihood optimization to estimate a sparse inverse covari- 
ance matrix has been studied extensively in the literature [d'Aspremont, 
Banerjee and El Ghaoui (2008), Yuan and Lin (2007), Friedman, Hastie and 
Tibshirani (2008), Rothman et al. (2008), Lam and Fan (2009)]. Although 
we impose sparsity on the off-diagonal entries of the inverse error covariance 
matrix A -1 , we expect that in many situations A -1 may not be sparse. We 
selected this estimator since it is able to adapt, with its tuning parameter, 
to both sparse and less sparse estimates, and can lead to a substantial re- 
duction in variability of our reduction estimator when p is large; however, 
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using small values of the tuning parameter A to give less-sparse inverse error 
covariance estimates leads to slow convergence of algorithms to solve (5.4) 
when p is large. 

Rothman et al. (2008) established the consistency of Aj^ 1 in a special case 
of model (2.2) characterized by the following four conditions: (A) r = 0, so 
A = l7 r Ljn is the usual estimator of the marginal covariance matrix of X. 

(B) The errors e are normally distributed with mean and variance A. 

(C) The largest <£> max (A) and smallest ( / 3 m ; n (A) eigenvalues of A are uni- 
formly bounded; that is, for all p 

(5.5) 0<fc<^ min (A)<v? max (A) < k < oo, 

where k and k are constants. (D) A X \/^S£. Then [Rothman et al. (2008)] 

(5.6) || Aj; 1 - A- 1 1| = O p {^n-\s(jp) + 1) logp), 

where s(p) is the total number of nonzero off-diagonal entries of A -1 , which 
could grow with p. 

In this application S = W^fW - W)W- X / 2 = A 1 / 2 (Aj^ 1 - A -1 ) A 1 / 2 . 
To find the order of ||S|| and thereby allow application of Proposition 5.1 
with SPICE, we relax the conditions of Rothman et al. (2008) by allow- 
ing (A*) r > and assuming that (B*) the elements Ej of e are uniformly 
sub- Gaussian random variables. That is, we assume there exist positive con- 
stants K\ and Ki such that for all t > and all p 

(5.7) P(| £i |>t)<K ie -^ t2 , 3 = 1,..., p. 

This assumption is compatible with universal conditions (W.l) and (W.2). 
When the £j's are normal, it is equivalent to requiring that their variances 
are uniformly bounded as p — > oo. The following lemma gives the orders 
of ||S|| and ||E(S 2 )|| required for conditions (S.l) and (S.2). 

Lemma 5.1. Let W = Aj^ 1 , let s = s(p) be the total number of nonzero 
off-diagonal entries of A. and let u> sp i cc (n,p) = y / n _1 (s + 1) logp. Under con- 
ditions (B*), (C) and (D), ||S|| = O p (u; spicc ) and ||E(S 2 )|| = 0(w 2 picc ). 

Applying Proposition 5.1 (hi) in the context of SPICE, p = I p because 
W = A -1 . Consequently, var(i/) = and tp = yfpjh^pn < k. From this we 

immediately get the following convergence rate bound for = R-spicc- 

Proposition 5.2. Assume that the conditions and notation of Lem- 
ma 5.1 hold. Then 

(5.8) R spice (XAr) - R(Xat) = Op (rc) + O p (w spice ). 
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If the number s of nonzero off-diagonal entries of A is bounded asp-> 
oo and the regression is abundant (k x n" 1 / 2 ), then R sp i cc (XAr) — R(Xjv) = 
O p {n~ 1 / 2 log 1//2 p). This allows for the number of variables p to grow much 
faster than the sample size, so long as s is bounded. On the other extreme, 
if there is no sparsity so that s =p(p— 1), then the bounding rate implied in 
Proposition 5.2 would seem to indicate that n needs to be large relative to 
{p(p — 1) + 1} logp for good results. Based on our results for normal errors in 
Section 6, we anticipate that this rate is not sharp, most notably when A -1 
is not sparse, and that it can be improved, particularly when additional 
structure is imposed (see Section 6). 

Conditions (B*), (C) and (D) required for Proposition 5.2 are in addition 
to the active universal constraints stated in Section 4. Of the universal con- 
straints, we still require the signal rate property (4.1), the order k = Oil) 
and (W.2). The remaining universal constraints are implied by the condi- 
tions of the proposition: since p = I p , the regression is stable by Lemma 4.1, 
(W.l) holds and ip < k so (4.3) holds. 

5.4. A diagonal weight matrix. We include in this section a discussion 
of the asymptotic behavior of the reduction based on the diagonal weight 
matrix W = diag -1 (A), where A was defined in (5.3). This weight matrix, 
which corresponds to the population weight matrix W = diag -1 (A), ignores 
the residual correlations and adjusts only for residual variances. However, 
in contrast to SPICE, there is no tuning parameter involved and so its 
computation is not an issue. 

With W = diag -1 (A), p is the residual correlation matrix and 

S = diag 1 / 2 (A) (diag -1 (A) - diag -1 ( A) ) diag 1 / 2 (A) . 

The following lemma gives the orders of ||S|| and ||E(S 2 )|| in preparation for 
application of Proposition 5.1. 

Lemma 5.2. Let W = diag -1 (A), let i^diag = n ~ 1 ^ 2 log 1 ^ 2 p and assume 
that the elements £j of the errors e are sub- Gaussian random variables. Then 
||S||=O p (u, diag ) and ||E(S 2 )||=G> 2 iag ). 

In contrast to Lemma 5.1, here we require only that the errors are sub- 
Gaussian and not uniformly sub-Gaussian. This is because the diagonal 
weight matrix effectively standardizes the variances. 

Proposition 5.3. Assume that the conditions and notation of Lem- 
ma 5.2 hold. Then 

(5.9) R diag (XAr) - R(Xjv) = v + O p {k) + O p (^) + O p (w diag ). 
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The order (5.9) for the diagonal weight matrix can be smaller than, equal 
to, or greater than the order (5.8) for the SPICE weight matrix, depending 
on the underlying structure of the regression. Using results from the dis- 
cussion of Section 5.1, if ||p|| = O(Vh) and span(7) is a reducing subspace 
of p, then Rdi a g(Xjv) — R(Xjy) = O p (k) + Op^^g), and thus the diagonal 
weight matrix can produce a better rate because Wdiag < w sp i ce . If A is it- 
self a diagonal matrix, then s = 0, var(iv) = 0, K>ip and Wdiag = w sp i ce , and 
consequently 

(5.10) Rdiag(X iV ) - R(X;v) = O p (n) + O p (a; diag ). 

In this case the two weight matrices result in the same order, so there seems 
to be no asymptotic loss incurred by SPICE when A is diagonal. 

6. Normal errors and £ = /3f . We consider in this section the relatively 
ideal situation in which the errors e are normally distributed and the user- 
selected basis function f correctly models the true coordinates, so £j = /3f;, 
i = 1, . . . , n, for some fixed matrix (3 G M dxr of rank d. The diagonal weight 
matrix is revisited under these assumptions in Section 6.1. In Section 6.2 we 
add the sample size constraint n> p + r + 4 so A>0. The importance of 
this setting is that we can obtain the oracle rate. 

6.1. Diagonal weight matrix. The rates (5.10) for a diagonal A can be 
sharpened considerably when the errors are normal and £ = /3f : 

Proposition 6.1. Assume that e~iV(0, A), A is a diagonal matrix, 
£ = /3f, n>r + 5 and W = diag _1 (A). Then R diag (XAr) - R(Xat) = O p (k). 

We see from this proposition that when the errors are normal, the order 
Op(^diag) that appears in (5.10) is no longer present and the rate for Rjiag 
reduces to the oracle rate k~ 1 . The condition n > r + 5 stated in Propo- 
sition 6.1 is needed to insure the existence of the variance of an inverse 
chi-squared random variable. Its proof is omitted from the supplement since 
it follows the same lines as the proof of Proposition 5.1. 

6.2. n > p + r + 4. Recalling that A £ MP xp is the residual covariance 
matrix defined in (5.3), this constraint on n means that A > with prob- 
ability 1, allowing the straightforward use of W = A as the weight ma- 
trix. The population weight matrix is W = A -1 with corresponding S = 
A 1 / 2 A -1 A 1 / 2 — I p . The asymptotic behavior of ||S|| in this setting can 
be obtained as follows: phrased in the context of this article, suppose that 
r = so A = (n - l)" 1 E" =1 (Xi - X)(X; - X) T . Johnstone (2001) [see also 
Bai (1999), page 635] showed that ^(A^^AA^ 1 / 2 ) - 1 x p (y/p/n) 
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and Paul (2005) showed that <p m i Q (A 1 / 2 AA l / 2 ) -lx p (y/p/n) when 

p/n — > a £ [0,1). Together these results imply that ||S|| >c p (y/p/n) and 
therefore ||S|| — > in probability if and only if p/n —> 0, which gives the 
order for condition (S.l). Still with r = 0, since (n — 1)A~ X is distributed 
as the inverse of a Wishart W p {A.,n — 1) matrix, we used results from von 
Rosen (1988) to verify condition (S.2). Combining these results with Propo- 
sition 5.1 we have 

(6.1) R i (X N )-K(X N ) = O p ( s /pJn), 

where R^ denotes the reduction with W = A" 1 . This suggests that it may 

be reasonable to use A -1 as the weight matrix only when n is large rela- 
tive to p, essentially sending us back to the usual requirement. In the next 
proposition we show this is not the case with r > because the order in (6.1) 
is in fact too large. 

Proposition 6.2. Assume that e~N p (0,A), £ = /3f, n>p+r+A and 
p/n— > [0,1). Let a =(n — p — l)~ 1 . Then when W = A -1 ; 

(i) h- 1 E(K) = a(n-r-l){K + K 2 I r }, 

(ii) h~ 1 (p){K-E(K)} = O p (l/^i), 
(hi) K a (X n )-K(X n ) = O p (k). 

Recall from Corollary 5.1 that when W = A -1 , we obtain the oracle rate 
Ra(Xjv) — R(Xjv) = Op (re). Comparing this result with the conclusion of 
Proposition 6.2(iii), we see that there is no impact on the rate of conver- 
gence when using W = A" 1 instead of W = A -1 , both weight matrices 
giving estimators that converge at the oracle rate. In contrast to the gross 
bound given in (6.1), Proposition 6.2 indicates that it may in fact be quite 
reasonable to use A -1 as the weight matrix when n > p + r + 4, without 
requiring n to be large relative to p. 

Properties of the normal and the accurate modeling of £ = /3f surely 
contribute to the oracle rate of Proposition 6.2. Recall from (3.1) that 
K = $y 2 B T WB$y 2 . Under the conditions of Proposition 6.2, B JL W, 
which is not required for our most general results given in Proposition 5.1. 
Additionally, we were able to use exact calculations in places where bound- 
ing was otherwise employed. The inverse residual sum of squares matrix 
(re — r — 1)A _1 is distributed as the inverse of a Wishart W p (A.,n — r — 1) 
matrix and the moments of an inverse Wishart derived by von Rosen (1988) 
were used extensively in the proof of Proposition 6.2. For instance, al- 
though A has an inverse with probability 1 when n > p+r + 1, the constraint 
on n in the hypothesis is needed to ensure the existence of the variance of 
the inverse Wishart. 
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The signal rate (4.1) and the order k = 0(1) are still required for Propo- 
sition 6.2, but its hypothesis implies all the other universal conditions stated 
in Section 4: since W = A -1 we have p = l p and thus (1) the regression is 
stable by Lemma 4.1, (2) t/; < k so (4.3) holds, (3) E(£ T We) = p so condition 
(W.l) holds, and (4) condition (W.2) holds by Lemma 4.2(i). 

7. Simulations. The effects of predictor correlations are often a concern 
when dealing with high-dimensional regressions. In this section, we introduce 
predictor correlations into our simulations to give some intuition about their 
impact on the four reduction estimators. 

The computation of the four reduction estimators Ri, R sp icc> Rdiag and 

relies on the computation of their weight matrix estimators W, which are 
available in closed form for Ri,Rdi ag and R^; however, computing the 

weight matrix estimator for R sp i cc requires us to select its tuning parame- 
ter A and to use an iterative algorithm. 

7.1. Computation and tuning parameter selection for R sp i C c • Several com- 
putational algorithms [d'Aspremont, Banerjee and El Ghaoui (2008), Yuan 
and Lin (2007), Friedman, Hastie and Tibshirani (2008), Rothman et al. 
(2008)] have been proposed to compute the solution to (5.4), of which we 
propose to use the graphical lasso (glasso) algorithm of Friedman, Hastie 
and Tibshirani (2008). We employ i'f- fold cross-validation to select the 
tuning parameter A for the weight matrix estimator = AT" 1 , where 
we minimize the validation negative log-likelihood. Specifically, we solve 
A = arg min A J2k=i 9k (A) where 

g k (\) = tr[n^A^ x) A {k ^w[- k) ] - log|W<- fc) |, 

A (M) =^)-FWb(-^fi-^. 

In the above expression, a superscript of (k) indicates that the quantity is 
based on the observations inside the kth. fold, and a superscript of (—k) 
indicates that the quantity is based on observations outside of the kth fold, 
and n k is the number of observations in the /cth fold. 

7.2. Simulation settings. The simulated data were generated from Xj = 
r£j + Si, i = 1, . . . ,n, where Y\, . . . , Y n is a sequence of independent random 
variables, e±,. .. ,e n are independent copies of e, a p-variate normal random 
vector with mean and variance A, and £j = £(Yi) is specified later. As 
in the second simulation of Section 5.2, T and A were produced anew at 
the start of each replication, with the elements of T S MP always generated 
as a sequence of independent standard normal variables and A generated 

MS 

D 1 /2@ D l/2 where D 

is a diagonal matrix with diagonal elements sam- 
pled from a uniform (1, 101) distribution and = (0j,) is a correlation ma- 
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trix with exponentially decreasing correlations, 9ij = with parameter 
< 9 < 1. This structure has been used frequently to assess the impact of 
predictor correlations on high-dimensional methodology [see, e.g., Bickel and 
Levina (2008b), Li and Yin (2008)]. For all four reduction estimators, hxp 
implying k = n" 1 / 2 . 

For each replication, we determined R.(Xjvj) and R(Xjv.j) at j = 1, . . . , 100 
new data points generated from the original model. We assess performance 
by computing the magnitude of the sample correlation between R and R 
over the 100 new data points. The results are based on 200 independent 
replications. 

For a given weight matrix W, the expected signal strength over simulation 
replications is determined by E s i m (r^Wr) =E s j m tr(W), where E s i m de- 
notes expectation over the simulations. For W = diag~ 1 (A), E s ; m [tr(W)] = 
log(101)/100, but for W = A -1 , 

E !lmN W)] = t r(e -> g (101)/100 = 2 + fr-_2)(l + ^) lo^ 

From this we see that the expected signal strength increases with 9 when 
using R^ and R sp i cc , but is constant in 6 when using Rdiag an d Ri- In 
addition, ||A|| = 0(1) since ||A|| < HAH^ = max,- JY Sij < 2 • 101/(1 - 9), 
where 6ij is element of A. 

With this general setup we next report results for various combinations 
of n, p, £ and f. We extended the applicability of R^ to regressions in 

which n < p by using the Moore-Penrose generalized inverse of A, although 
we have presented no asymptotic results for this case. 

7.3. Correctly specified £ = /3f . In this section we consider a simple case 
where £ = Y, giving d = l, and f = (Y, Y 2 , Y 3 , Y 4 ), so r = 4 and £ = /3f, 
where (3 = (1,0,0,0) and Y was generated as a standard normal variate. 

7.3.1. n = p/2 andp^-oo. In this setting, n and p grow with n = p/2. 
For the reduction estimators Ri and Rdiag) ||p|| = 0(1), ||p||f < v^llHI = 
0(y/p) which imply that ip = K = n~ 1 / 2 . Also, var(i/) < G^^AH/^ = 0(p _1 ), 
hence v = O p (p~ 1 ). Thus in this setting with n x p, Ri is -^/n-consistent 

and Rdiag is & t least \jnj log n-consistent as n,p— >oo. 

Although || A || = 0(1) and A -1 is a tri-diagonal matrix, our theoreti- 
cal bounds for R sp i C c guarantee consistency for this model only when p is 
bounded and n — > oo; however, a result established by Ravikumar et al. 
(2011), which requires additional assumptions, indicates that the weight 
matrix estimator for R sp icc is consistent when A -1 is tri-diagonal. 

The results for p and n growing with n = p/2 are shown in Figure 3(a)- 
(c). All reduction estimators appear to be converging to the population 
reduction as n,p— >■ oo, even though consistency is not guaranteed in this 
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setting for R sp i C c an d R^, which uses a Moore-Penrose generalized inverse 

of the residual sample covariance matrix. Interestingly, when p > n, R^ 

outperforms Rdiag an d Ri when 9 > 0.9. Results for R sp i C c were computed up 
to p = 400 when 9 < 0.9 and up to p = 100 when 9 = 0.99 due to intractable 
computation time required for the glasso algorithm. In scenarios when R S pi C c 
was computed, it outperformed the other reduction estimators, particularly 
when 9 = 0.9. As expected, larger values of 9 lead to favorable performance 
for the reduction estimators with population weight matrix W = A -1 . 

7.3.2. p = 100 and n—too. In this setting we fix p = 100 and let n grow. 
Our theory guarantees that R sp i C e and R^ are both y^-consistent. On the 
other hand, Ri and Rdiag are inconsistent since span (7) is not a reducing 
subspace of p and p is bounded, implying v fails to vanish. 
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The results for p = 100 and n growing are illustrated in Figure 4(a)-(c). 
As our theory suggests, the reduction estimators R sp icc and R^ appear to 
be converging to the population reduction as n increases. We see that R sp icc 
outperformed the other reduction estimators, particularly for relatively 
small n when = 0.9. As expected, R^ outperforms Rdiag an d Ri when n 
is much larger than p or when 9 is large. 

7.4. Results for £ ^ /3f . In this section we present results for a misspeci- 
fied £ using £ = var~ 1 / 2 (exp(y)) [exp(Y) - E(exp(Y))] where Y ~ Unif (0, 4). 
Holding n = 50 and p = 100, we varied f = (y, y 2 , . . . , y k ) T for k = 1, . . . , 5 
and f = exp(y). The results are summarized in Figure 5 as boxplots of the 
correlation magnitudes over the 200 replications. Most striking is how little 
the choice of f seems to affect the estimators. This is likely because there 
are fairly strong correlations between £ and its approximations provided by 
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Fig. 5. Estimators of R when f is misspecified, using exponential error correlations: 
Oij — 0.5' l ~-'', n — 50 and p= 100. Boxplots are labeled by the highest order term m f. 
Here (a) R sp ic C ; (b) R d ia g ; (c) R s , (d) Ri. 



the six f's used in the simulation, which satisfies condition (3.3). The rela- 
tionship between the estimators are similar to those in Figure 4 at n = 50, 
regardless of the choice of f . 

8. Spectroscopy application. To illustrate operation of the proposed meth- 
odology, we examine data from the potential application area mentioned in 
the Introduction. The response variable is the percentage of fat in samples of 
beef or pork, and the predictors are the absorbance spectra (log(l/i?)) from 
near-infrared transmittance for fat measured at every second wavelength 
between 850 and 1050 nm, giving p = 100 with n = 54 [Sasb0 et al. (2007)]. 
The goal of the study is to predict the response from the absorbance spectra 
of a new sample. Our predictive framework is described in the next section. 
We return to the spectroscopy application in Section 8.2. 
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8.1. Prediction. We predict an unobserved response Yv associated with 
a new observed vector of predictors X^r by using the forward regression mean 
function: l^-ed = E{Y|Xat} = E{Y|R(Xtv)}- However, the reduction R(X) 
was based on the inverse regression X|Y and the development did not pro- 
duce a direct estimator of E{Y|R(Xtv)}- There are perhaps several ways of 
using an estimated reduction for predicting a new response. Some authors 
have used standard data-analytic methods to develop predictive models 
based on Y and the estimated reduction, and there are a variety of nonpara- 
metric regression methods that could also be used as well. We follow Adragni 
and Cook (2009) and use a kernel-type estimator of E{Y|R(Xtv)} based on 
the relationship E{Y|X = x} = E{Y|R(x)} = E{Yg(R(x)|Y)}/E{ 9 (R(x)|Y)}, 
where g is the conditional density of R|Y. This provides a method to esti- 
mate E{Y|X}: 

n 

E{Y|X = x} = ^^(x)Y, 

i=l 

(8.1) 

E?=i5(R(x)|Yi) 

where g denotes an estimate of the density and R is the estimated reduc- 
tion. This estimator is reminiscent of a nonparametric kernel estimator, but 
there are consequential differences. The weights in a kernel estimator do 
not depend on the response, while the weights Wi here do. Kernel weights 
typically depend on the full vector of predictors X, while the weights here 
depend on X only through the estimated reduction R(x). Multivariate ker- 
nels are usually taken to be the product of univariate kernels, corresponding 
here to treating the components of R as independent. Finally, there is no 
need for bandwidth estimation because the weights are determined entirely 
from g. 

When d is small relative to p it may often be reasonable to assume 
that R(X)|Y is normally distributed, which seems appropriate for the spec- 
troscopy data. Ignoring constants not depending on i, we have 

5 (R(x)|Y) «exp{-(l/2)(R(x) -^) T (r r A- 1 T)(R(x) -^)}. 

Substituting the estimators T, bfj and W for T, £j and A -1 gives the 
weights required for (8.1). For Rj, we set W = (p/ tr(A))I p . 

8.2. Spectroscopy. Since there are only p = 100 predictors it was straight- 
forward, albeit somewhat tedious, to inspect inverse response plots of Xj ver- 
sus Y, j = 1, . . . , 100, to gain information about the likely structure of f . We 
performed two analyses; the first consisted of 54 pork samples and the second 
consisted of 103 meat samples of beef and pork. For a specified value of d, we 
assessed the fitted model by using the residual mean square RMS(d, r, R(.) ) = 
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Ya=iO^ ~ Yi) 2 / n i where the fitted values Yj = E(Y|X = Xj) were deter- 
mined using (8.1) with the indicated combination of d, r and reduction R(.)- 
We selected the value of d by adapting the permutation scenario developed 
by Cook and Yin (2001). The hypothesis d = do was tested sequentially 
starting at do = and estimating d as the first hypothesized value that was 
not rejected. The test statistic RMS(do + l,r,R(,)) was compared to the 
distribution of RMS((io + 1, r, R-(.)) induced by 1000 random permutations J 
applied to the rows of the predictor matrix X as follows: 

X perm = XPr W)+ JXQ^ ) , 

where T and W were computed under the null hypothesis. This scheme 
leaves the signal XPp intact while permuting the uninformative part of 

the predictors XQ~^^. 

The multicollinearity of the predictors made computing the weight matrix 
estimator for R S pi C0 difficult for small values of its tuning parameter, values 
for which our cross-validation procedure recommended. We subsequently 
set its tuning parameter to A = for both analyses, since this was the 
smallest value for which a numerically stable solution was available. 



8.2.1. Analysis of pork samples. In this case we concluded that a cubic 
polynomial f(y) = {y,y 2 ,y 3 ) T would be adequate; a representative plot is 
shown in Figure 6(a). Performing the permutation test to select d, the test 
statistic for d = using R^ was RMS(1,3,R^) =0.31 which was smaller 



2.90 



2.85 - 




2.70 



14 



28 



I 

35 



2.90 



CO 


2.85 - 




-C 

Ol 






c 











2.80 - 








*-+-- 








o 
o 


2.75 - 





2.65 



I 

14 



21 



28 



35 



Percent Fat, Y 

(a) 



Percent Fat, Y 



(b) 



Fig. 6. Inverse response plot of the absorbance at wavelength 854 versus Y for (a) pork 
samples and (b) pork and beef samples. The line for the pork sample is a fitted cubic 
polynomial. For pork and beef the lines represent a second-order polynomial fit to the data 
with one intercept for pork (solid) and a second intercept for beef (dashes). 
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than the smallest value 10.7 of RMS(1,3,R^) observed among the 1000 
random permutations under the hypothesis that d = 0. Since the test statis- 
tic is much smaller than can be accounted for by chance under the null 
hypothesis, we concluded that d > 1. Similarly, to test d = l, we observed 
RMS(2, 3, R^) = 0.29 which fell at the 80th quantile of the permutation dis- 
tribution of RMS (2, 3, R^) under the hypothesis. Consequently, we used d = 
1 for the model. Comparisons with other values of r figured in our choice r = 
3. Cross-validation using RMS as the criterion might also be used to select d. 

Fitting each of the four estimators with d = l, we found that RMS(1,3, 
R) = 21.55, 5.15, 0.31 and 2.15 for R = R I} R diag , R A and R S pi cc . Plots of Y 
versus Y are shown in Figure 7(a)— (d). The predictors in this illustration are 
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Fig. 7. Plots of the response versus fitted values for the four estimators (a) R.^, 
(b) R S pi cc , (c) Rdiag and (d) Ri, using the pork samples. 
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highly collinear, as is typical in spectral data, and the relative performance 
of the four estimators is qualitatively similar to that shown in previous sim- 
ulations; however, numerical instability degraded the performance of R sp i cc . 
The relative signal rates for the four estimators are reflected by the values 

of f T wf = 35.3, 128.4, 169.7 and 63.38 for W = (p/tr(A))I d) diag _1 (A), 
A" and Ar 1 . 

A 

8.2.2. Analysis of both pork and beef samples. In this case we concluded 
that a second-order polynomial and the indicator function of beef would 
be adequate, f(y) = (y, y 2 , J(beef)) T ; a representative plot is shown in Fig- 
ure 6(b). Using the permutation test approach to select d, the test statistic 
for d = using R A was RMS(1, 3, R^) = 0.55 which fell at the 0.003 quan- 

tile of RMS(1, 3, R^) observed among the 1000 random permutations under 
the hypothesis that d = 0. Since the test statistic is much smaller than can 
be accounted for by chance under the null hypothesis, we concluded that 
d > 1. Similarly, to test d = 1, we observed RMS(2, 3, R^) = 0.01 which fell 

at the 0.70 quantile of the permutation distribution of RMS(2, 3, R^) under 
the hypothesis. Consequently, we used d = 1 for the model. 

Fitting each of the four estimators with d=l, we found that RMS(1,3, 
R) = 41.96, 41.65, 0.55 and 4.66 for R = Ri, R diag , R^ and R S p ice . Plots 
of Y versus Y are shown in Figure 8(a)-(d). The relative signal rates for the 
four estimators are reflected by the values of T T Wr = 18.4, 59.2, 3346.6 
and 69.3 for W = (p/tr(A))I d , diag _1 (A), A" 1 and Ar 1 . The relatively 

A 

large signal for A" 1 is reflected in the plots of Figure 8. 

9. Discussion. The class of estimators that we studied is limited in scope 
relative to the range of SDR methods presently available for p = o(n) regres- 
sions. However, in the broader context of this article, which does not require 
p = o(n), we introduced the concept of an abundant regression and presented 
what may be the first n,p asymptotic analysis of a class of SDR methods 
when the focus is on estimating a reduction R rather than on the under- 
lying central subspace. These ideas can in principle be extended for other 
SDR methods, and we expect that the same general issues will be encoun- 
tered. While each of the methods that we studied can perform usefully in 
the right situation, we judge the SPICE and A - weight matrices to be the 
best overall, although improvements for nonsparse weight matrices and for 
regressions with very highly correlated predictors are still needed. 

Our simulation results were all based on normal errors to focus the presen- 
tation and save space. However, we have also conducted a variety of parallel 
simulations using Uniform (0, 1), T§ and %| errors, each centered and appro- 
priately scaled. The results were essentially the same as those with normal 
errors [Cook, Forzani and Rothman (2012)]. 
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Fig. 8. Plots of the response versus fitted values for the four estimators 



R; 



(b) R S pi cc , (c) Rdiag and (d) Ri. Circles represent pork and plus signs represent beef. 



9.1. Penalty functions. Alternative penalty functions may be used for 
estimating the weight matrix, particularly in scenarios when the inverse error 
covariance matrix is not sparse. For example, the sparse-seeking penalty 
function in (5.4), , ■ could be replaced with the quadratic penalty 
function, 

(9.1) *(e^+"XX-). 

Wi i=i / 

where a G {0, 1} controls whether or not the diagonal of fi, the inverse 
error correlation matrix, is penalized. If a = 0, the general SPICE algo- 
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rithm developed by Rothman et al. (2008) can efficiently solve (5.4) with 
the penalty defined in (9.1). If a = 1, Witten and Tibshirani (2009) de- 
rived an noniterative solution to an equivalent problem to (5.4) with the 
penalty defined in (9.1). Recalling that <fj(A.) denotes the jth eigenvalue 
of a matrix A, let r]j = v?j(diag -1 / 2 (A) A diag~ 1 / 2 (A)). Witten and Tibshi- 
rani showed that the eigenvectors of Q\ are equivalent to the eigenvectors 
of diag^ 1 / 2 (A)Adiag- 1/2 (A) and that ^(O a ) = (4A) _1 {(»$ + 8A) 1 / 2 - rjj}. 

9.2. Choice off. The general rates given in Proposition 5.1 are not very 
sensitive to the choice of f since they hold when f satisfies the minimal 
rank condition (3.3). Nevertheless, assuming normality and a correct f, we 
obtained the oracle rates of Proposition 6.2, which indicates that there are 
advantages to pursuing good choices. The methods sketched in Section 3.2 
are often useful in practice, but it is also possible to develop semiparamet- 
ric methods to estimate £ directly rather than passing through approxima- 
tions (3f. This might be accomplished iteratively: choose an initial f and 
construct the corresponding estimates T 1 , £ x = /3 1 f and R|^- A new esti- 
mate of £ can be obtained by smoothing the coordinates of Ri^ against Y, 
leading to a second reduction estimate R 2 ^- The process can now be con- 
tinued until some convergence criterion is met. 

9.3. Variable selection. While we did not incorporate screening or vari- 
able selection into our reduction methodology, the potential benefits of 
those procedures are manifested in our results. Consider, for instance, a re- 
gression in which p 2 / 3 of the predictors are inactive. Then the oracle rate 
k _1 = (p//in) _1//2 = n 1 / 2 ^ -2 / 3 . However, if we remove p 1 / 3 of the inactive 
predictors, the oracle rate is increased to k" 1 = n 1 / 2 ^" 1 / 3 , which should be 
worthwhile in most applications. Work along these lines is in progress. 

Acknowledgments. The authors are grateful to the referees whose helpful 
comments led to significant improvements in this article. 

SUPPLEMENTARY MATERIAL 

Supplement to "Estimating sufficient reductions of the predictors in abun- 
dant high-dimensional regressions" (DOI: 10.1214/11-AOS962SUPP; .pdf). 
Owing to space constraints, we have placed the technical proofs in a supple- 
mental article [Cook, Forzani and Rothman (2012)]. The supplement also 
contains several preparatory technical results that may be of interest in their 
own right and additional simulations. For instance, we gave in Section 7 sim- 
ulation results from models with exponentially decreasing error correlations. 
In the supplemental article we give parallel results based on the same models 
but with constant error correlations. 
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